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Abstract: We propose an approach to multivariate noiiparametric regression that 
generalizes reduced rank regression for linear models. An additive model is estimated 
for each dimension of a q-dimensional response, with a shared p-dimensional predictor 
variable. To control the complexity of the model, we employ a functional form of the 
Ky-Fan or nuclear norm, resulting in a set of function estimates that have low rank. 
Backfitting algorithms are derived and justified using a nonparametric form of the 
nuclear norm subdifferential. Oracle inequalities on excess risk are derived that exhibit 
the scaling behavior of the procedure in the high dimensional setting. The methods are 
illustrated on gene expression data. 

J> \ Keywords and phrases: multivariate regression, nonparametric regression, nuclear 

norm regularization, high dimensional inference, oracle risk bounds. 

1. Introduction 



In the multivariate regression problem the objective is to estimate the conditional mean 

E(Y | X) = m(X) = {m x {X), . . . , m q (X)) T , 

where Y is a q- dimensional response vector, X is a p-dimensional covariate vector, and we 
are given a sample of n i.i.d. pairs {(XW,Yw)} f rom the joint distribution of X and Y. 
This is also referred to as multi-task learning in the machine learning literature. Under a 
linear model, the mean is estimated as m(X) = BX where B e M. qxp is a q x p matrix of 
regression coefficients. When the dimensions p and q are large relative to the sample size n, 
the coefficients of B cannot be reliably estimated, without further assumptions. 

In reduced rank regression the matrix B is estimated under a rank constraint r = 
rank(i?) < C, so that the rows or columns of B lie in an r-dimensional subspace of M. q 
or MP. Intuitively, this implies that the model is based on a smaller number of features than 
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the ambient dimensionality p would suggest, or that the tasks representing the components 
Y k of the response are closely related. In low dimensions, the constrained rank model can be 
computed as an orthogonal projection of the least squares solution; but in high dimensions 
this is not well defined. 

Recent research has studied the use of the nuclear norm as a convex surrogate for the rank 
constraint. The nuclear norm ||-B||*, also known as the trace or Ky-Fan norm, is the sum of 
the singular vectors of B. A rank constraint can be thought of as imposing sparsity, but in an 
unknown basis; the nuclear norm plays the role of the i\ norm in sparse estimation. Its use 
for low rank estimation problems was proposed by Fazel (2002). More recently, nuclear norm 
regularization in multivariate linear regression has been studied by Yuan et al. (2007), and 
by Negahban and Wainwright (2011), who analyzed the scaling properties of the procedure 
in high dimensions. 

In this paper we study nonparametric parallels of reduced rank linear models. We focus our 
attention on additive models, so that the regression function m(X) = (m 1 (X), . . . , m q (X)) T 
has each component m k (X) = Yl P j=i m j(-^-j) ec L ua l to a sum of p functions, one for each 
covariate. The objective is then to estimate the qxp matrix of functions M(X) = [m^(X,-)j . 

The first problem we address, in Section 2, is to determine a replacement for the regular- 
ization penalty in the linear model. Because we must estimate a matrix of functions, the 
analogue of the nuclear norm is not immediately apparent. We propose two related regular- 
ization penalties for nonparametric low rank regression, and show how they specialize to the 
linear case. We then study, in Section 4, the (infinite dimensional) subdifferential of these 
penalties. In the population setting, this leads to stationary conditions for the minimizer 
of the regularized mean squared error. This subdifferential calculus then justifies penalized 
backfitting algorithms for carrying out the optimization for a finite sample. Constrained rank 
additive models (cram) for multivariate regression are analogous to sparse additive models 
(SpAM) for the case where the response is 1-dimensional (Ravikumar et al, 2009) (studied 
also in the reproducing kernel Hilbert space setting by Raskutti, Wainwright and Yu (2012)), 
but with the goal of recovering a low-rank matrix rather than an entry-wise sparse vector. 
The backfitting algorithms we derive in Section 5 are analogous to the iterative smoothing 
and soft thresholding backfitting algorithms for SpAM proposed by Ravikumar et al. (2009). 
A uniform bound on the excess risk of the estimator relative to an oracle is given Section 7. 
This shows the statistical scaling behavior of the methods for prediction. The analysis re- 
quires a concentration result for nonparametric covariance matrices in the spectral norm. 
Experiments with synthetic, gene, and biochemistry data are given in Section 8, which are 
used to illustrate different facets of the proposed nonparametric reduced rank regression 
techniques. 
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2. Nonparametric Nuclear Norm Penalization 



We begin by presenting the penalty that we will use to induce nonparametric regression 
estimates to be low rank. To motivate our choice of penalty and provide some intuition, sup- 
pose that . . . , f q (x) are q smooth one-dimensional functions with a common domain. 
What does it mean for this collection of functions to be low rank? Let 

(n) be 

a collection of points in the common domain of the functions. We require that the n x q 
matrix of function values ¥(x^ 1:n ^) = [f k (x^)] is low rank. This matrix is of rank at most 
r < q for every set {x^} of arbitrary size n if and only if the functions {f k } are r-linearly 
independent — each function can be written as a linear combination of r of the other functions. 

In the multivariate regression setting, but still assuming the domain is one-dimensional 
for simplicity (q > 1 and p = 1), we have a random sample . . . , X^ n \ Consider the 

n x q sample matrix M = [m h (X^)} associated with a vector M = (m 1 , . . . , m q ) of q smooth 
(regression) functions, and suppose that n > q. We would like for this to be a low rank 
matrix. This suggests the penalty 



s=l 



where {\ S (A)} denotes the eigenvalues of a symmetric matrix A and {a s (B)} denotes the 
singular values of a matrix B. Now, assuming the columns of M are centered, and E[m fc (X)] = 
for each k, we recognize ^M T M as the sample covariance E(M) of the population covariance 

S(M) := Cov(M(X)) = [E{m k (X)m l {X))]. 

This motivates the following sample and population penalties, where A 1 ! 2 denotes the matrix 
square root: 

population penalty: ||S(M) 1/2 ||, = || Cov(M(X)) 1/2 ||, (2.1) 

sample penalty: ||S(M) 1/2 ||, = -L||M||*. (2.2) 

Jn 



We will also use the notation |||M|||* = || Cov(M(X)) 1 / 2 ||*. 

This leads to the following population and empirical regularized risk functionals for low 
rank nonparametric regression: 

population penalized risk: ^E||Y - M(X) || 2 + A||S(M) 1/2 1|* (2.3) 

empirical penalized risk: — \\Y - M\\ 2 F + -4=||M||*, (2.4) 

2n ^/n 

where, in the empirical case, Y denotes the n x q matrix of response values for the sample 
{(A"W,yW)}. We recall that if A >z has spectral decomposition A = UDU T then A 1 ' 2 = 
UD 1 ' 2 ^. 
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3. Constrained Rank Additive Models (cram) 



We now consider the case where X is p- dimensional. Throughout the paper we use super- 
scripts to denote indices of the g-dimensional response, and subscripts to denote indices 
of the p-dimensional covariate. We consider the family of additive models, with regression 
functions of the form 

m(X) = (m l (X), . . . , m«(X)) T = £ M,(X 3 ), 

j'=i 

each term Mj(Xj) = (rrij(Xj), . . . ,m'j(Xj)) T being a g-vector of functions evaluated at Xj. 

In this setting we propose two different penalties. The first penalty, intuitively, encourages 
the vector (m^(Xj), . . . , m^(Xj)) to be low rank, for each j. Assume that the functions 
m,j(Xj) all have mean zero; this is required for identifiability in the additive model. As 
a shorthand, let Sj = £(Mj) = Cov(M ? (X J )) denote the covariance matrix of the j-th 
component functions, with sample version E 3 -. The population and sample versions of the 
first penalty are then given by 



1/21 



+ s 



1/21 



+ ••• + 



1/2 I 



U/2| 



(3.1) 
(3.2) 



The second penalty encourages the set of q vector-valued functions (m^, m*, . . . , rrip) T to be 
low rank. This penalty is given by 



1/2 



sy 2 ) 



(g;/ 2 ...gf) 

where, for convenience of notation, 



n 



L l:p\ 



T 



(3.3) 
(3.4) 



n-.p 



(M.J ■ ■ ■ Ml) ' is an np x q matrix. The corre- 
sponding population and empirical risk functionals, for the first penalty, are then 

v o P 



1 



E 



y-J2mj(x) + a£||e 



i 

2n 



v 

3=1 



a/2 1 



i=i 
p 



+ 



7 = 1 



(3.5) 



(3.6) 



1. In the linear case we have 



and similarly for the second penalty. 

Now suppose that each Xj is normalized so that EpT 2 ) 
Mj(Xj) = XjBj where Bj e W. Let B = (B 1 ■ ■ ■ B p ) e R qxp . Some straightforward calcula- 
tion shows that the penalties reduce to 



1/2 



r l/2| 



= \\Bj\\2 

= \\B\L. 



(3.7) 
(3.8) 
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Thus, in the linear case the first penalty is encouraging B to be column-wise sparse, so 
that many of the BjS are zero, meaning that Xj doesn't appear in the fit. This is a version 
of the group lasso (Yuan and Lin, 2006). The second penalty reduces to the nuclear norm 
regularization \\B\\* used for high-dimensional reduced-rank regression. 



4. Subdifferentials for Functional Matrix Norms 

A key to deriving algorithms for functional low-rank regression is computation of the sub- 
differentials of the penalties. We are interested in (q x p)-dimensional matrices of functions 
F = [fj]- For each column index j and row index k, /* is a function of a random variable 
Xj, and we will take expectations with respect to Xj implicitly. We write Fj to mean the jth 
column of F, which is a q- vector of functions of Xj. We define the inner product between 
two matrices of functions as 

3=1 k=l j=l 

and write ||-F|| 2 = \/ ([F, F}}. Note that \\F\\ 2 equals the Frobenius norm of a/E(FF t ) where 
K(FF T ) = b is a positive semidefinite 5x5 matrix. 

We define two further norms on a matrix of functions F, namely, 



F||| sp := A /||E(FF T )|| sp = y/E(FF T ) and |||F|||* := \\y/E(FF 



where ||A|| is the spectral norm (operator norm), the largest singular value of A, and it is 
convenient to write the matrix square root as \/A = A 1 / 2 . Each of the norms depends on F 
only through K(FF T ). In fact, these two norms are dual — for any F, 

\\\F\\U= sup {G,F}, (4.2) 

II|G||| SP <1 

where the supremum is attained by setting G = ^E(FF T ) j F, with A" 1 denoting the 
matrix pseudo-inverse. 

Proposition 4.1. The subdifferential o/|||F|||* is the set 

S{F) := {(VE(FFT))~V + # : |||#||| sp < 1, E(FH T ) = qxg , K(FF T )H = qxp a.e.j . 

(4.3) 

Proof. The fact that S(F) contains the subdifferential can be proved by comparing 

our setting (matrices of functions) to the ordinary matrix case; see Recht, Fazel and Parrilo 
(2010); Watson (1992). Here, we show the reverse inclusion, S{F) C <9|||F|||*. Let D G S{F) 
and let G be any element of the function space. We need to show 

111^ + > \\\F\\U + {(G,D)} , (4.4) 
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where D = \^^E(FF T )j F + H =: F + H for some H satisfying the conditions in (4.3) 
above. Expanding the right-hand side of (4.4), we have 



+ ((G,D) = |||F|||* + (G, F + H)) 
= {(F + G,F + H}} 
< IH-F + G|||*|||£)||| S p , 

where the second equality follows from = ((F,F)), and the fact that ([F,H) 

tr (E(Fif T )) = 0. The inequality follows from the duality of the norms. 
Finally, we show that |||-D||| sp < 1. We have 

E(DD T ) = E(FF T ) + E(FH T ) + E(HF T ) + E(HH T ) 
= E(FF T ) + E(HH T ) , 



(4.5) 
(4.6) 
(4.7) 



(4.8) 
(4.9) 



where we use the fact that E(FH T ) = qxq , implying E(FH T ) = qxq . Next, let E(FF T ) = 
VDV T be a reduced singular value decomposition, where D is a positive diagonal matrix of 
size q' < q. Then E(FF T ) = VV T , and we have 

E(FF T ) ■ H = qxp a.e. <^> V T H = q > Xp a.e. <=> E(FF T )H = qxp a.e. . 

This implies that E(FF T ) ■ E(HH T ) = qxq and so these two symmetric matrices have 
orthogonal row spans and orthogonal column spans. Therefore, 



\E(DD~ 



isp 



= max 
< 1 , 



E(FF T ) + E^HH 1 
E(FP 



sp 

E(HH~ 



sp 



isp 



(4.10) 

(4.11) 
(4.12) 



where the last bound comes from the fact that |||-F||| sp , |||f/"||| sp < 1. Therefore |||Z)||| sp < 1. □ 

This gives the subdifferential of penalty 2, defined in (3.3). We can view the first penalty 
update as just a special case of the second penalty update. For penalty 1 in (3.1), if we are 
updating Fj and fix all the other functions, we are now penalizing the norm 



E(F^ 



(4.13) 



which is clearly just a special case of penalty 2 with a single q- vector of functions instead of 
p different g-vectors of functions. So, we have 



dm 



j^E^-F/)) 1 F j +H i : |||^|||sp<l, E(F j Hj) = 1 E(FjFj)Hj = a.e.| 
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5. Stationary Conditions and Backfitting Algorithms 

Returning to the base case of p = 1 covariate, consider the population regularized risk 
optimization 

mm{±E\\Y-M(X)\\l + \\\\M\l}, (5.1) 

where M is a vector of q univariate functions. The stationary condition for this optimization 
is 

E(Y | X) = M(X) + XV(X) a.e. for some V G <9|||M|||*. (5.2) 
Define P{X) := E(Y \ X). 

Proposition 5.1. Let E(PP T ) = U diag(r)[/ T be the singular value decomposition and 
define 

M = U diag([l - \/y/r\ + )U T P (5.3) 

where [x} + = max(x,0). Then M satisfies stationary condition (5.2), and is a minimizer of 
(5.1). 

Proof. Assume the singular values are sorted as T\ > r 2 > • ■ ■ > r q , and let r be the largest 
index such that ^T r > A. Thus, M has rank r. Note that ^/E(MM T ) = U diag([</F- \]+)U T , 
and therefore 

A( V / E(MM T ))" 1 M = f/diag(A/ v ^,0 <? _ r )[/ T P (5.4) 
where %i± = (xi, . . . , x^) and Ck = (c, . . . , c). It follows that 

M + X(^/E(MM T )) ~ l M = U diag(l r , q _ r )U T P. (5.5) 

Now define 

H = -U diag(O r , V r )£/ T P (5.6) 
A 

and take V = (^/E(MM T ))~ 1 M + H. Then we have M + XV = P. 

It remains to show that H satisfies the conditions of the subdifferential in (4.3). Since 

VEjHiF) = f/diag(O r , yWA, . . . , ^T q /X)U T (5.7) 
we have |||-f/"||| S p < 1- Also, E(MP T ) = qxq since 

diag(l - A/ y/n7 n q _ r ) diag(O r , 1,_ P / A) = 0, x ,. (5.8) 
Similarly, E(MM T )P = qxq since 

diag((v/n^ - A) 2 , g _ r ) diag(O r , l 9 _ r / A) = qxq . (5.9) 

It follows that V E 0|||M |||sp. □ 



7 



CRAM BACKFITTING ALGORITHM — FIRST PENALTY 

Input: Data matrices X G M. nxp and Y G M nX9 , regularization parameter A > 0. 
Initialize Mj = (mJ(X y )) = G M nX9 , for j = l,...,p. 
Iterate until convergence: 

For each j = 1, . . . ,p: 

(1) Compute the residual Zj <— Y — X^y^''- 

(2) Estimate Pj = ~E[Zj \ Xj] by smoothing: Pj = SjZj. 

(3) Compute SVD: \P]Pj = diag(r)L7 T . 

(4) Soft threshold: M,- <- Pj C/diag([l - A/v^ + )f/ T . 

(5) Center: Mj <- ML,- - mean(M,). 

Output: Component functions Mj and estimates of the conditional mean vector . My. 

Fig 1 . TTie CRAM backfitting algorithm, using the first penalty, which penalizes each component. 



The analysis above justifies a backfitting algorithm for estimating a constrained rank 
additive model with the first penalty, where the objective is 



minf-E 

Mj 12 



Y 



3=1 



|M, 



(5.10) 



For a given coordinate j, we form the residual Zj = Y — J2k^jMk, and then compute the 
projection Pj = ~E(Zj \ Xj), with singular value decomposition K(PjPj) = U diag(r)£7 T . We 
then update 

Mj = L7diag([l - \/^} + )U T Pj (5.11) 

and proceed to the next variable. This is a Gauss-Seidel procedure that parallels the popu- 
lation backfitting algorithm for SpAM (Ravikumar et ai, 2009). 

In the sample version we replace the conditional expectation Pj = K(Zj | Xj) by a non- 
parametric linear smoother, Pj = SjZj. The algorithm is given in Figure 1. The algorithm for 
penalty 2 is similar and given in Figure 2. Both algorithms can be viewed as functional pro- 
jected gradient descent procedures. Note that to predict at a point x not included in the train- 
ing set, the smoother matrices are constructed using that point; that is, Pj(xj) = Sj(xj) T Zj. 



6. Working over an RKHS 

Suppose that the functions m 1 ? are required to lie in a Hilbert space %j. A modified empirical 
optimization is (for the first penalty) 

p ■ 2 " " " 

F 



Y-J2^3 

3=1 



3=1 



j=i k=i 



(6.1) 
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CRAM BACKFITTING ALGORITHM — SECOND PENALTY 



Input: Data matrices X G W ixp and Y G M nX9 , regularization parameter A > 0. 
Initialize Mj = (mJ(X y )) = G M nX9 , for j = l,...,p. 
Iterate until convergence: 

For each j = 1, . . . ,p: 

(1) Compute the residual Zj <— Y — ^ji^M-f. 

(2) Estimate Pj = 7sL[Zj \ Xj] by smoothing: Pj = SjZj. 

(3) Compute SVD: ^P^Up = U diag(r)f/ T . 

(4) Soft threshold: M 1:p <- P 1:p C/diag([l - X/y/¥] + )U T . 

(5) Center: Mj <- ML,- - mean(Mj). 

Output: Component functions Mj and estimates of the conditional mean vector . My. 

Fig 2. TTie CRAM backfitting algorithm, using the second penalty, which penalizes the components together. 

where Y is an n x q data matrix and Mj is an n x q matrix of function values associated 
with the jth columns of a n x p data matrix X. The first penalty is then a nuclear-norm 
constraint on these observed function values. The second penalty is a smoothness penalty 
on each of the coordinate functions in the appropriate Hilbert space, and is not empirical. 

If Tij is an RKHS with kernel Kj, then the representer theorem implies that we can 
restrict to functions of the form 



•), (6-2) 



where the afj are real weights. In this case the optimization becomes a finite dimensional 
semidefmite program over a. This parallels the approach of Raskutti, Wainwright and Yu 
(2012) for sparse additive models; see also Dinuzzo and Fukumizu (2012). 

If Kj = [Kh^XijjXi'j)] G M. nxn denotes the Gram matrix for the jth variable, then 
Fj = KjCtj where atj = [a^-j G IR nx<7 . Using the first penalty, the convex optimization is then 



iiiin — Y - KjOtj ^ + X n HifjOfjll, + p n ^JaJfKjO^ (6.3) 

j j=l k=l j=l 

where the third term is a smoothness penalty for the RKHS. This is a cone program with 
constraints involving both the second-order cone and the semidefinite cone. 

7. Excess Risk Bounds 

The population risk of a q x p regression matrix M(X) = [Mi(Xx) ■ ■ ■ M p (X p )) is 

R(M) =E\\Y - M(X)l p \\l 
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with sample version denoted R(M). Consider all models that can be written as 

M{X) = U ■ D ■ V(X) T 

where U is an orthogonal qxr matrix, D is a positive diagonal matrix, and V(X) = [vj S (Xj)\ 
satisfies E(y T K) = I r . The population risk can be reexpressed as 



R(M) = tr 



DU 



E 



tr 



DU 



Y 

v{xy 

Eyy £yy 
Syy Syy 



Y 
V{X) 



DU 



DU T 



and similarly for the sample risk, with S n (V) replacing S(V) := Cov((Y, V(X) T )) above. The 
"uncontrollable" contribution to the risk, which does not depend on M, is R u = tr{Syy}. 
We can express the remaining "controllable" risk as 



R C {M) = R(M) - R u = tr 



-21, 
DU T 



0, 



Using the von Neumann trace inequality, tr(AB) < ||A||p||5|| p / where l/p + 1/p' = 1, 



i? c (M) - R C {M) < 



< 



-2I q 
DU 1 

-2I q 
DU 1 



(E(V)-En(V)) 



E(y) - E n (y) 











sp 





l-Dl 



sp 



sp 



<Cmax(2,||D|| sp ) - S n (V) 



I-DI 



< Cmax{2, pllj} 



sp 



(7.1) 



where here and in the following C is a generic constant. For the last factor in (7.1), it holds 
that 



sup 

v 



E(V) - E n (V) < C sup sup w T ( E(V) - £ n (V) ) w 



where M is a 1/2-covering of the unit (q + r)-sphere, which has size \J\f\ < 6 q+r < 36 g ; 
compare Vershynin (2012, p. 665). We now assume that the functions v S j(xj) are uniformly 
bounded from a Sobolev space of order two. Specifically, let {ipjk '■ k = 0, 1, . . .} denote a 
uniformly bounded, orthonormal basis with respect to L 2 [0, 1], and assume that v s j G Hj 
where 

oo oo 



k=0 



k=0 
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for some < K < oo. The Loo-covering number of 7ij satisfies log Af(Hj, e) < Kj y/e. 

Suppose that Y — E(Y \ X) = W is Gaussian and the true regression function E(F | X) is 
bounded. Then the family of random variables Z(y, w ) '■= \/n • w T (Y l (V) — S n (V))w is sub- 
Gaussian and sample continuous. It follows from a result of Cesa-Bianchi and Lugosi (1999) 
that 



E fsup sup w T (Z(V) - t n (V))w] < [ Wglog(36) + log(pg) + ^= de 
\ V weAT J V n Jo V V e 



for some constant B. Thus, by Markov's inequality we conclude that 

E{V)-E n (V) 



sup 

V 



O, 



S]) 



q + logipq) 



n 



(7.2) 



when n tends to infinity and q and p possibly change with n. If 

1/4 

/ \ 

O 



D 



(q + \og{pq)) 



then returning to (7.1), this gives us a bound on R C (M)—R C (M) that is op(l). More precisely, 
for fi n > 0, define the class of matrices of functions 

M(P n ) = {M : M(X) = UDV(X) T , with E(V T V) = I, v sj G Hj, \\D\\ m < f3 n } . (7.3) 

Then, for a fitted matrix M chosen from M.{f3 n ), writing = arg imn MeM (p n \ R(M), we 
have 

R(M) - inf R(M) = R(M) - R(M) - (R(M*) - R(M*)) + (R(M) - R(M*)) 

M£M(Pn) 



< [R(M] - R(M)] - [R(MJ - R(M*) 
Subtracting R u — R u from each of the bracketed differences, we obtain that 

R(M) - inf R(M) < \R C (M) - RAM)] - \R C (M*) - R C (M* 
MeM(i3 n ) 



< 2 sup \ R C (M) - R C (M) 

MeM(/3„) 



by (7.1) 

< o P 



Now if 

then we may conclude from (7.2) that 



\D\ 



E{V) - E n (y) 



sp 



n 



q + \og(pq) 



1/4 



(7.4) 



R(M)- inf ■ R(M)=o P (l). 

MeM(p n ) 



This proves the following result. 
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Penalty 1 Penalty 2 




Fig 3. Example of the constrained-rank backfitting algorithm, with q = 3 and p — 4. The model is the same 
for each k; that is, rrij — rrij for each j,k,k'. The left plots show the fits with penalties <Vj|||-Mj|||*, with 
X = (3,3,0,0). The right plots use penalty 2, and the solution has rank 1. 

Proposition 7.1. Let M minimize the empirical risk - £\ — £\ Mj(Xy) ||| over the 
class Ai(/3 n ). Suppose that Y — E(Y | X) is Gaussian, the true regression function K(Y \ X) 
is bounded, and /3 n satisfies (7.4) as n — >■ oo. Then it holds that 

R(M) - inf R(M) -A . 

MeM(/3 n ) 

8. Examples 

8.1. Synthetic Data Example 

Figure 3 shows an example of the backfitting algorithms for penalties 1 and 2. For this 
example q = 3 and p = 4. The model is the same for each k; that is, rrij = rrij for each j, k, k'. 
The true regression functions are m\(x) = sin(2x), m\{x) = x 2 — C2, m^(x) = x, and m^(x) = 
e~ x — c 4 , where c 2 and C4 are centering constants. The left group of plots in Figure 3 shows the 
fits obtained with penalties Aj|||Mj|||* with regularization parameters A = (Ai, A2, A3, A4) = 
(3,3,0,0), so that the third and fourth functions are not regularized. The plots show how 
the first and second function estimates are identical up to multiplicative scaling for each 
k — 1, 2, 3 (Mi and M 2 have rank one), while the third and fourth function estimates vary 
(M3 and M4 have rank three). The fits are made with local linear smoothing. The right 
set of plots corresponds to penalty 2, and the solution has rank 1. We intentionally use a 
bandwidth that is too small, to better show the differences with and without regularization. 
The true regression functions m k - are shown in blue in the plots; the fitted functions are in 
red, with the fits for a given j superimposed on the same plot. The sample size is n = 150, 
and the noise variance is a 2 = 1. 

8.2. Gene Expression Data 

To further illustrate the proposed nonparametric reduced rank regression techniques, we con- 
sider data on gene expression in E. coli from the "DREAM 5 Network Inference Challenge" 1 

^ttp : //wiki . c2b2 . Columbia. edu/dream/ index .php/D5c4 
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Fig 4. fiis on the gene expression data. Left: Penalty 
via 10-fold cross-validation. Plotted points are residua 



Penalty 2, A = 5 




with large tuning parameter. Right: Penalty 2 tuned 
holding out the given predictor. 



(Marbach et ai, 2012). In this challenge genes were classified as transcription factors (TFs) 
or target genes (TGs). Transcription factors regulate the target genes, as well as other TFs. 

We focus on predicting the expression levels Y for a particular set of q = 27 TGs, using 
the expression levels X for p = 6 TFs. Our motivation for analyzing these 33 genes is that, 
according to the gold standard gene regulatory network used for the DREAM 5 challenge, 
the 6 TFs form the parent set common to two additional TFs, which have the 27 TGs as 
their child nodes. In fact, the two intermediate nodes d-separate the 6 TFs and the 27 TGs 
in a Bayesian network interpretation of this gold standard. This means that if we treat the 
gold standard as a causal network, then up to noise, the functional relationship between X 
and Y is given by the composition of a map g : M 6 — > M. 2 and a map h : M 2 — > M 27 . If g and h 
are both linear, their composition h o g is a linear map of rank at most than 2. As observed 
in Section 2, such a reduced rank linear model is a special case of an additive model with 
reduced rank in the sense of penalty 2. More generally, if g is an additive function and h is 
linear, then hog has rank at most 2 in the sense of penalty 2. Higher rank can in principle 
occur under functional composition, since even a univariate additive map h : R — > 1R 9 may 
have rank up to q under our penalties (penalty 1 and 2 coincide for univariate maps). 

The backfitting algorithm of Figure 1 with penalty 1 and a rather aggressive choice of the 
tuning parameter A produces the estimates shown in Figure 4, for which we have selected 
three of the 27 TGs. Under such strong regularization, the 5th column of functions is rank 
zero and, thus, identically zero. The remaining columns have rank one; the estimated fitted 
values are scalar multiples of one another. We also see that scalings can be different for 
different columns. The third plot in the third row shows a slightly negative slope, indicating a 
negative scaling for this particular estimate. The remaining functions in this row are oriented 
similarly to the other rows, indicating the same, positive scaling. This property characterizes 



13 



Penalty 2, A = 1 

weight volume specific gravity 




No regularization 

weight volume snecitic gravity 










Fig 5. Fzis on i/ie biochemistry data with penalty 2 (left), with A = 1. XTie solution has rank three, with 
regression functions for creatinine, phosphate and phosphorous identical up to scaling. The fits on the right 
have no regularization, and are full rank. 



the difference between penalties 1 and 2; in an application of penalty 2, the scalings would 
have been the same across all functions in a given row. 

Next, we illustrate a higher-rank solution for penalty 2. Choosing the regularization pa- 
rameter A by ten- fold cross-validation gives a fit of rank 5, considerably lower than 27, the 
maximum possible rank. Figure 4 shows a selection of three coordinates of the fitted func- 
tions. Under rank five, each row of functions is a linear combination of up to five other, 
linearly independent rows. We remark that the use of cross-validation generally produces 
somewhat more complex models than is necessary to capture an underlying low-rank data- 
generating mechanism. Hence, if the causal relationships for these data were indeed additive 
and low rank, then the true low rank might well be smaller than five. 



8.3. Biochemistry Example 

Here we analyze the same biochemical data of Smith et al. (1962) that was used by Yuan et al. 
(2007). The data contain chemical measurements for 33 individual samples of men's urine 
specimens. The q = 5 response variables are pigment creatinine, and the concentrations 
(in mg/ml) of phosphate, phosphorous, creatinine and choline. The p = 3 covariates are 
the weight of the subject, and volume and specific gravity of the specimen. Yuan et al. 
(2007) form a linear model where the coefficients in a spline basis are regularized. Their 
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plots suggest some boundary effects, due to the choice of basis. Here we use our backfitting 
algorithm with local linear smoothing. No explicit basis is used. Figure 5 shows the result 
of using regularization A|||Mi :p |||* with A = 1 (penalty 2), and bandwidth h — .3 for all 
variables. The regularized solution has rank 3, where the response variables for phosphorous, 
phosphate, and creatinine concentrations are scaled versions of each other. The plots on the 
right show fits with no regularization (A = 0). 

9. Summary 

This paper introduced two penalties that induce reduced rank fits in multivariate additive 
nonparametric regression. Under linearity, the penalties specialize to group lasso and nuclear 
norm penalties for classical reduced rank regression. Examining the subdifferentials of each 
of these penalties, we developed backfitting algorithms for the two resulting optimization 
problems that are based on soft-thresholding of singular values of smoothed residual matrices. 
The algorithms were demonstrated on a gene expression data set and a biochemical data 
set that have low-rank structure. We also provided a persistence analysis that shows error 
tending to zero under a scaling assumption on the sample size n and the dimensions q and p. 
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